Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore freestanding ellipses #235

Closed
wants to merge 2 commits into from
Closed

Ignore freestanding ellipses #235

wants to merge 2 commits into from

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Aug 23, 2024

Fixes sillsdev/serval#458.

I think this is the simplest solution. Let me know what you think.


This change is Reviewable

@Enkidu93 Enkidu93 requested review from ddaspit and johnml1135 August 23, 2024 17:02
@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/UsfmMemoryTextTests.cs line 217 at r1 (raw file):

            @"\id MAT - Test
\c 1
\v 1 Verse text ... More text

What about things like this:

\v 1 ...
\q1 ...
\q2 ...
\v2 ... \f footnote \f*

I'm probably doing it wrong, but what is the defined behavior if the ellipses is between other USFM markers?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 241 at r1 (raw file):

                {
                    text = "";
                }

Should this happen before tokenization? I'm not quite understanding.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 174 at r1 (raw file):

        {
            // if we hit text in a verse paragraph and we aren't in a verse, then start a non-verse segment
            if (text.Trim().Length > 0 && text.Trim() != "...")

This seems like the wrong place to change it - this was added to solve a different issue. Does this change cause the tests to fail? It likely should...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 241 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should this happen before tokenization? I'm not quite understanding.

This may work - but I don't understand this part of the parse to know where all the places that it will work. I also don't know how paratext populates \q1, and other poetry markers - nor to I know how this may handle populating section headers etc. with ... if that is an issue.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To respond to all your comments in one place, @johnml1135, my goal was to treat verse's whose only text is '...' the same as white-space-only verses. This might be a hacky solution, but I'm not sure what the less hacky solution would be. I could certainly define a regex or something if we're interested in potentially treating other text as white space (e.g., only punctuation or something). Or maybe there's a way to think of this as a special kind of usfm token, but I also don't want to go overboard just for this small issue.

Judging from the project I was working with in which I originally saw this issue, it appears that '...' only appears as filler for an entire verse text - i.e., I didn't see examples of multiple poetic lines (or something similar) each with '...', but I'm not sure if those might exist as well since, like I said, I'm just basing this off one project.

Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to handle the specific case where Paratext inserts ellipses to fill an empty verse. I think you should narrow the check even further and only apply it when you are creating text rows for a verse, i.e. UsfmTextBase.EndVerseText. I want to avoid incorrectly interpreting the intentional use of ellipses as empty text.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

I agree that changing the code there would prevent partial matches. @ddaspit - do you know all the instances in which the ellipsis might be added in Paratext?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this, @ddaspit ?

Yeah, it would be nice to know what Paratext's behavior is such that we can make appropriately narrow and also so that we can have reasonable tests.

Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.81%. Comparing base (1220953) to head (0d743a7).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #235   +/-   ##
=======================================
  Coverage   69.80%   69.81%           
=======================================
  Files         379      379           
  Lines       31478    31482    +4     
  Branches     4399     4400    +1     
=======================================
+ Hits        21974    21978    +4     
  Misses       8483     8483           
  Partials     1021     1021           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmTextBase.cs line 276 at r2 (raw file):

            protected override void EndVerseText(UsfmParserState state, IReadOnlyList<ScriptureRef> scriptureRefs)
            {
                string text = _rowTexts.Pop().ToString();

I think a simple comment would be appropriate here - what are we doing and why? "Paratext sometimes auto-adds an ellipses for newly instantiated verses. This should be ignored for NLP tasks." or similar.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check on the Paratext behavior and get back to you.

Reviewed 1 of 2 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out where the ellipses are coming from. I thought they were something that Paratext would insert in some cases, but I can't find any evidence of that. They might be coming from Adapt It, but I'm not sure how the ellipses would get back into the source USFM from Adapt It. If this isn't standard Paratext behavior, we shouldn't handle this in Machine. Instead, we should handle this during preprocessing in Serval.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ask Michael about it tomorrow (?), @ddaspit, before committing to that switch. He seemed fairly sure that Paratext created the ellipses when the team made an empty book and my impression was that this was typical.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke to Michael about it. He doesn't know where they come from. I actually looked at the Paratext repo and found no trace of any code that adds ellipses. I looked at a couple of projects that contain them and can't find a consistent pattern.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so what's the plan here. Do you want to relocate this logic? What do you mean by preprocessing in Serval? There isn't really any file preprocessing is there, right?

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to filter out any segments that are ... when generating the training corpus and pretranslation JSON file in PreprocessBuildJob.WriteDataFilesAsync.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just making sure I wasn't missing something 😆

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

Moving filtration to serval: sillsdev/serval#469

@Enkidu93 Enkidu93 closed this Aug 27, 2024
@Enkidu93 Enkidu93 deleted the ignore_ellipses branch August 27, 2024 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suppress "..." verse text during preprocessing of the model training data
4 participants